What is "jailbreaking" a large language model (LLM)?
“Jailbreaking” a Large language model
An AI model that takes in some text and predicts how the text is most likely to continue.
A chatbot interface for the GPT series of large language models by OpenAI.
Examples include the “grandma locket” image jailbreak, the “Do Anything Now” (DAN) jailbreak, and jailbreaks found by automatically generating adversarial prompts.
Overall, techniques like RLHF
A method for training an AI to give desirable outputs by using human feedback as a training signal.
Further reading:
- Lakera’s Gandalf is an interactive “game” where you can get a feel for jailbreaking by getting an LLM to reveal its “password”.